Unsupervised and Semi-supervised Myanmar Word Segmentation Approaches for Statistical Machine Translation

نویسندگان

  • Ye Kyaw Thu
  • Andrew Finch
  • Eiichiro Sumita
  • Yoshinori Sagisaka
چکیده

In statistical machine translation (SMT), word segmentation is generally a necessary step for languages that do not naturally delimit words. For many low-resource languages there are no word segmentation tools, and research on word segmentation for these languages is often quite scarce. In this paper, we study several plausible methods for Myanmar word segmentation for machine translation in order to shed light on promising avenues for future research. We propose a novel Bayesian learning method that can perform semisupervised word segmentation with reference to a dictionary. We applied our method to the task of translating with Myanmar language, and compare our method to the following approaches to segmentation: human lexical/phrasal segmentation, character segmentation, syllable segmentation, purely unsupervised word segmentation, and the method of maximum matching. We found that unsupervised segmentation was the most effective segmentation. It outperformed maximum matching, which in turn was better than syllable segmentation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation

Words in Chinese text are not naturally separated by delimiters, which poses a challenge to standard machine translation (MT) systems. In MT, the widely used approach is to apply a Chinese word segmenter trained from manually annotated data, using a fixed lexicon. Such word segmentation is not necessarily optimal for translation. We propose a Bayesian semi-supervised Chinese word segmentation m...

متن کامل

Unsupervised Translation Disambiguation for Cross-Domain Statistical Machine Translation

Most attempts at integrating word sense disambiguation with statistical machine translation have focused on supervised disambiguation approaches. These approaches are of limited use when the distribution of the test data differs strongly from that of the training data; however, word sense errors tend to be especially common under these conditions. In this paper we present different approaches t...

متن کامل

Unsupervised Bilingual Morpheme Segmentation and Alignment with Context-rich Hidden Semi-Markov Models

This paper describes an unsupervised dynamic graphical model for morphological segmentation and bilingual morpheme alignment for statistical machine translation. The model extends Hidden Semi-Markov chain models by using factored output nodes and special structures for its conditional probability distributions. It relies on morpho-syntactic and lexical source-side information (part-of-speech, m...

متن کامل

Tuning Syntactically Enhanced Word Alignment for Statistical Machine Translation

We introduce a syntactically enhanced word alignment model that is more flexible than state-of-the-art generative word alignment models and can be tuned according to different end tasks. First of all, this model takes the advantages of both unsupervised and supervised word alignment approaches by obtaining anchor alignments from unsupervised generative models and seeding the anchor alignments i...

متن کامل

Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013